About Dataset

Description: Breast cancer is the most common cancer amongst women in the world. It accounts for 25% of all cancer cases, and affected over 2.1 Million people in 2015 alone. It starts when cells in the breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or felt as lumps in the breast area.

The key challenges against it’s detection is how to classify tumors into malignant (cancerous) or benign(non cancerous). We ask you to complete the analysis of classifying these tumors using machine learning (with SVMs) and the Breast Cancer Wisconsin (Diagnostic) Dataset.

Acknowledgements: This dataset has been referred from Kaggle. link : https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset

Problem Statement: Breast Cancer Classification using Support Vector Machine (SVM)

Breast cancer is one of the most common and life-threatening diseases affecting women worldwide. Early diagnosis significantly improves the chances of successful treatment and recovery. However, manual diagnosis based on biopsy or imaging can be time-consuming and subject to human error.

This project aims to develop a binary classification model using Support Vector Machine (SVM) to automatically distinguish between malignant and benign breast tumors using the Breast Cancer Wisconsin dataset. The goal is to create a reliable, efficient, and accurate model that can assist healthcare professionals in decision-making.

Objectives:

Load and preprocess the Breast Cancer dataset to make it suitable for binary classification.

Train SVM classifiers using both linear and RBF kernels.

Visualize the decision boundaries using 2D projection of the data.

Tune hyperparameters like C and gamma using GridSearchCV for better performance.

Evaluate model performance using cross-validation, confusion matrix, and classification report

In [ ]:
 

Importing require Libraries¶

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder , StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV,cross_val_score

from sklearn.svm import SVC
from sklearn.decomposition import PCA

from sklearn.metrics import confusion_matrix, classification_report ,accuracy_score

import the Breast Cancer Dataset¶

In [29]:
breast_dataset = pd.read_csv("breast-cancer.csv")
In [30]:
print("Top 10 rows :\n")
breast_dataset.head(10)
Top 10 rows :

Out[30]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.30010 0.14710 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.08690 0.07017 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.19740 0.12790 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.24140 0.10520 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.19800 0.10430 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678
5 843786 M 12.45 15.70 82.57 477.1 0.12780 0.17000 0.15780 0.08089 ... 15.47 23.75 103.40 741.6 0.1791 0.5249 0.5355 0.1741 0.3985 0.12440
6 844359 M 18.25 19.98 119.60 1040.0 0.09463 0.10900 0.11270 0.07400 ... 22.88 27.66 153.20 1606.0 0.1442 0.2576 0.3784 0.1932 0.3063 0.08368
7 84458202 M 13.71 20.83 90.20 577.9 0.11890 0.16450 0.09366 0.05985 ... 17.06 28.14 110.60 897.0 0.1654 0.3682 0.2678 0.1556 0.3196 0.11510
8 844981 M 13.00 21.82 87.50 519.8 0.12730 0.19320 0.18590 0.09353 ... 15.49 30.73 106.20 739.3 0.1703 0.5401 0.5390 0.2060 0.4378 0.10720
9 84501001 M 12.46 24.04 83.97 475.9 0.11860 0.23960 0.22730 0.08543 ... 15.09 40.68 97.65 711.4 0.1853 1.0580 1.1050 0.2210 0.4366 0.20750

10 rows × 32 columns

In [4]:
Total_Rows = breast_dataset.shape[0]
Total_Cols = breast_dataset.shape[1]
print("Total Rows is :",Total_Rows)
print("Total columns is :", Total_Cols)
Total Rows is : 569
Total columns is : 32
In [5]:
print( "\n Information about the Titanic dataset : \n " )
breast_dataset.info()
 Information about the Titanic dataset : 
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             569 non-null    float64
 15  area_se                  569 non-null    float64
 16  smoothness_se            569 non-null    float64
 17  compactness_se           569 non-null    float64
 18  concavity_se             569 non-null    float64
 19  concave points_se        569 non-null    float64
 20  symmetry_se              569 non-null    float64
 21  fractal_dimension_se     569 non-null    float64
 22  radius_worst             569 non-null    float64
 23  texture_worst            569 non-null    float64
 24  perimeter_worst          569 non-null    float64
 25  area_worst               569 non-null    float64
 26  smoothness_worst         569 non-null    float64
 27  compactness_worst        569 non-null    float64
 28  concavity_worst          569 non-null    float64
 29  concave points_worst     569 non-null    float64
 30  symmetry_worst           569 non-null    float64
 31  fractal_dimension_worst  569 non-null    float64
dtypes: float64(30), int64(1), object(1)
memory usage: 142.4+ KB
In [ ]:
 

Columns distribution¶

In [6]:
feature_cols =  breast_dataset.columns.tolist()
print("Total columns is here :\n", breast_dataset.columns.tolist())
Total columns is here :
 ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']
In [7]:
numerical_col = breast_dataset.select_dtypes(include=["int64","float64"]).columns
print("Total Numerical columns list is here :\n", numerical_col)
print("\nTotal Numerical columns is here :\n" ,numerical_col.value_counts().sum())
Total Numerical columns list is here :
 Index(['id', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

Total Numerical columns is here :
 31
In [8]:
categorical_col = breast_dataset.select_dtypes(include=["O"]).columns
print("Total categorical columns  list is here :\n", categorical_col)
print("\nTotal categorical columns is here :\n" ,categorical_col.value_counts().sum())
Total categorical columns  list is here :
 Index(['diagnosis'], dtype='object')

Total categorical columns is here :
 1
In [ ]:
 

basic statistics about the Iris data :¶

In [9]:
breast_dataset.describe().T
Out[9]:
count mean std min 25% 50% 75% max
id 569.0 3.037183e+07 1.250206e+08 8670.000000 869218.000000 906024.000000 8.813129e+06 9.113205e+08
radius_mean 569.0 1.412729e+01 3.524049e+00 6.981000 11.700000 13.370000 1.578000e+01 2.811000e+01
texture_mean 569.0 1.928965e+01 4.301036e+00 9.710000 16.170000 18.840000 2.180000e+01 3.928000e+01
perimeter_mean 569.0 9.196903e+01 2.429898e+01 43.790000 75.170000 86.240000 1.041000e+02 1.885000e+02
area_mean 569.0 6.548891e+02 3.519141e+02 143.500000 420.300000 551.100000 7.827000e+02 2.501000e+03
smoothness_mean 569.0 9.636028e-02 1.406413e-02 0.052630 0.086370 0.095870 1.053000e-01 1.634000e-01
compactness_mean 569.0 1.043410e-01 5.281276e-02 0.019380 0.064920 0.092630 1.304000e-01 3.454000e-01
concavity_mean 569.0 8.879932e-02 7.971981e-02 0.000000 0.029560 0.061540 1.307000e-01 4.268000e-01
concave points_mean 569.0 4.891915e-02 3.880284e-02 0.000000 0.020310 0.033500 7.400000e-02 2.012000e-01
symmetry_mean 569.0 1.811619e-01 2.741428e-02 0.106000 0.161900 0.179200 1.957000e-01 3.040000e-01
fractal_dimension_mean 569.0 6.279761e-02 7.060363e-03 0.049960 0.057700 0.061540 6.612000e-02 9.744000e-02
radius_se 569.0 4.051721e-01 2.773127e-01 0.111500 0.232400 0.324200 4.789000e-01 2.873000e+00
texture_se 569.0 1.216853e+00 5.516484e-01 0.360200 0.833900 1.108000 1.474000e+00 4.885000e+00
perimeter_se 569.0 2.866059e+00 2.021855e+00 0.757000 1.606000 2.287000 3.357000e+00 2.198000e+01
area_se 569.0 4.033708e+01 4.549101e+01 6.802000 17.850000 24.530000 4.519000e+01 5.422000e+02
smoothness_se 569.0 7.040979e-03 3.002518e-03 0.001713 0.005169 0.006380 8.146000e-03 3.113000e-02
compactness_se 569.0 2.547814e-02 1.790818e-02 0.002252 0.013080 0.020450 3.245000e-02 1.354000e-01
concavity_se 569.0 3.189372e-02 3.018606e-02 0.000000 0.015090 0.025890 4.205000e-02 3.960000e-01
concave points_se 569.0 1.179614e-02 6.170285e-03 0.000000 0.007638 0.010930 1.471000e-02 5.279000e-02
symmetry_se 569.0 2.054230e-02 8.266372e-03 0.007882 0.015160 0.018730 2.348000e-02 7.895000e-02
fractal_dimension_se 569.0 3.794904e-03 2.646071e-03 0.000895 0.002248 0.003187 4.558000e-03 2.984000e-02
radius_worst 569.0 1.626919e+01 4.833242e+00 7.930000 13.010000 14.970000 1.879000e+01 3.604000e+01
texture_worst 569.0 2.567722e+01 6.146258e+00 12.020000 21.080000 25.410000 2.972000e+01 4.954000e+01
perimeter_worst 569.0 1.072612e+02 3.360254e+01 50.410000 84.110000 97.660000 1.254000e+02 2.512000e+02
area_worst 569.0 8.805831e+02 5.693570e+02 185.200000 515.300000 686.500000 1.084000e+03 4.254000e+03
smoothness_worst 569.0 1.323686e-01 2.283243e-02 0.071170 0.116600 0.131300 1.460000e-01 2.226000e-01
compactness_worst 569.0 2.542650e-01 1.573365e-01 0.027290 0.147200 0.211900 3.391000e-01 1.058000e+00
concavity_worst 569.0 2.721885e-01 2.086243e-01 0.000000 0.114500 0.226700 3.829000e-01 1.252000e+00
concave points_worst 569.0 1.146062e-01 6.573234e-02 0.000000 0.064930 0.099930 1.614000e-01 2.910000e-01
symmetry_worst 569.0 2.900756e-01 6.186747e-02 0.156500 0.250400 0.282200 3.179000e-01 6.638000e-01
fractal_dimension_worst 569.0 8.394582e-02 1.806127e-02 0.055040 0.071460 0.080040 9.208000e-02 2.075000e-01
In [10]:
breast_dataset.describe(include="all").T
Out[10]:
count unique top freq mean std min 25% 50% 75% max
id 569.0 NaN NaN NaN 30371831.432337 125020585.612224 8670.0 869218.0 906024.0 8813129.0 911320502.0
diagnosis 569 2 B 357 NaN NaN NaN NaN NaN NaN NaN
radius_mean 569.0 NaN NaN NaN 14.127292 3.524049 6.981 11.7 13.37 15.78 28.11
texture_mean 569.0 NaN NaN NaN 19.289649 4.301036 9.71 16.17 18.84 21.8 39.28
perimeter_mean 569.0 NaN NaN NaN 91.969033 24.298981 43.79 75.17 86.24 104.1 188.5
area_mean 569.0 NaN NaN NaN 654.889104 351.914129 143.5 420.3 551.1 782.7 2501.0
smoothness_mean 569.0 NaN NaN NaN 0.09636 0.014064 0.05263 0.08637 0.09587 0.1053 0.1634
compactness_mean 569.0 NaN NaN NaN 0.104341 0.052813 0.01938 0.06492 0.09263 0.1304 0.3454
concavity_mean 569.0 NaN NaN NaN 0.088799 0.07972 0.0 0.02956 0.06154 0.1307 0.4268
concave points_mean 569.0 NaN NaN NaN 0.048919 0.038803 0.0 0.02031 0.0335 0.074 0.2012
symmetry_mean 569.0 NaN NaN NaN 0.181162 0.027414 0.106 0.1619 0.1792 0.1957 0.304
fractal_dimension_mean 569.0 NaN NaN NaN 0.062798 0.00706 0.04996 0.0577 0.06154 0.06612 0.09744
radius_se 569.0 NaN NaN NaN 0.405172 0.277313 0.1115 0.2324 0.3242 0.4789 2.873
texture_se 569.0 NaN NaN NaN 1.216853 0.551648 0.3602 0.8339 1.108 1.474 4.885
perimeter_se 569.0 NaN NaN NaN 2.866059 2.021855 0.757 1.606 2.287 3.357 21.98
area_se 569.0 NaN NaN NaN 40.337079 45.491006 6.802 17.85 24.53 45.19 542.2
smoothness_se 569.0 NaN NaN NaN 0.007041 0.003003 0.001713 0.005169 0.00638 0.008146 0.03113
compactness_se 569.0 NaN NaN NaN 0.025478 0.017908 0.002252 0.01308 0.02045 0.03245 0.1354
concavity_se 569.0 NaN NaN NaN 0.031894 0.030186 0.0 0.01509 0.02589 0.04205 0.396
concave points_se 569.0 NaN NaN NaN 0.011796 0.00617 0.0 0.007638 0.01093 0.01471 0.05279
symmetry_se 569.0 NaN NaN NaN 0.020542 0.008266 0.007882 0.01516 0.01873 0.02348 0.07895
fractal_dimension_se 569.0 NaN NaN NaN 0.003795 0.002646 0.000895 0.002248 0.003187 0.004558 0.02984
radius_worst 569.0 NaN NaN NaN 16.26919 4.833242 7.93 13.01 14.97 18.79 36.04
texture_worst 569.0 NaN NaN NaN 25.677223 6.146258 12.02 21.08 25.41 29.72 49.54
perimeter_worst 569.0 NaN NaN NaN 107.261213 33.602542 50.41 84.11 97.66 125.4 251.2
area_worst 569.0 NaN NaN NaN 880.583128 569.356993 185.2 515.3 686.5 1084.0 4254.0
smoothness_worst 569.0 NaN NaN NaN 0.132369 0.022832 0.07117 0.1166 0.1313 0.146 0.2226
compactness_worst 569.0 NaN NaN NaN 0.254265 0.157336 0.02729 0.1472 0.2119 0.3391 1.058
concavity_worst 569.0 NaN NaN NaN 0.272188 0.208624 0.0 0.1145 0.2267 0.3829 1.252
concave points_worst 569.0 NaN NaN NaN 0.114606 0.065732 0.0 0.06493 0.09993 0.1614 0.291
symmetry_worst 569.0 NaN NaN NaN 0.290076 0.061867 0.1565 0.2504 0.2822 0.3179 0.6638
fractal_dimension_worst 569.0 NaN NaN NaN 0.083946 0.018061 0.05504 0.07146 0.08004 0.09208 0.2075
In [11]:
breast_dataset.describe(include="O").T
Out[11]:
count unique top freq
diagnosis 569 2 B 357
In [ ]:
 

Visualizations:¶

In [12]:
sns.pairplot(breast_dataset)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [13]:
correlation = breast_dataset.corr(numeric_only=True)

plt.figure(figsize=(43,35))
sns.set(font_scale=3.0)
sns.heatmap(correlation ,linewidths="4" ,annot=True ,cmap="coolwarm",       
            annot_kws={"size": 16}, fmt=".2f",)
plt.title("Correlation Metrix" , fontsize=28)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [14]:
for col in numerical_col:
    plt.figure(figsize=(6,2))
    sns.histplot(x=breast_dataset[col],color="gold",kde=True ,bins=30)
    plt.title(f"Histogram plot of {col}")
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [ ]:

Preprocessing :¶

1. Handle null values:¶

In [15]:
breast_dataset.isna().sum()
Out[15]:
id                         0
diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

its clearly watching you not missing value in dataset

2. Handle category values :¶

In [16]:
lb = LabelEncoder()
lb
breast_dataset.diagnosis= lb.fit_transform(breast_dataset.diagnosis)
In [17]:
breast_dataset.diagnosis[15:30]
Out[17]:
15    1
16    1
17    1
18    1
19    0
20    0
21    0
22    1
23    1
24    1
25    1
26    1
27    1
28    1
29    1
Name: diagnosis, dtype: int32

3. detect Outlines and Remove¶

In [18]:
for i in breast_dataset:
    q1 = breast_dataset[i].quantile(0.25)
    q3 = breast_dataset[i].quantile(0.75)
    
    iqr = q3-q1 
    
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    
    breast_dataset = breast_dataset[(breast_dataset[i] >= lower_bound) & ( breast_dataset[i] <= upper_bound)]
    
In [25]:
for col in breast_dataset:
    plt.figure(figsize=(6, 2))
    sns.boxplot(x=breast_dataset[col], color="green")
    plt.title(f"Boxplot of {col}")
    plt.tight_layout()
    plt.show()

import warnings 
warnings.simplefilter("ignore")
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Feature Engineering:¶

In [20]:
x = breast_dataset.drop("diagnosis",axis=1)
y = breast_dataset.diagnosis

x_train,x_test,y_train,y_test = train_test_split(x ,y, test_size=0.2 , random_state=42)
In [21]:
y.value_counts()
Out[21]:
diagnosis
0    219
1     14
Name: count, dtype: int64
In [22]:
target_var = y.map({0: "Benign", 1: "Malignant"}).value_counts()

plt.bar(target_var.index, target_var.values, color="hotpink")
plt.xlabel("Diagnosis")
plt.ylabel("Count")
plt.title("Diagnosis Distribution")
plt.show()
No description has been provided for this image

Feature Scaling¶

In [23]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

Visualize decision boundary using 2D PCA¶

In [31]:
# Step 2: PCA on full scaled data
pca = PCA(n_components=2)
X_2d = pca.fit_transform(x_scaled)
In [32]:
# Step 3: Now split PCA data and labels together

x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, random_state=42)

X_train_2d, X_test_2d, y_train_2d, y_test_2d = train_test_split(X_2d, y, test_size=0.2, random_state=42)
In [37]:
# Step 4: Train SVMs on 2D data

svm_linear_2d = SVC(kernel="linear", C=1).fit(X_train_2d, y_train_2d)
svm_rbf_2d = SVC(kernel="rbf", gamma="scale", C=1).fit(X_train_2d, y_train_2d)

# Step 5: Plot decision boundary
def plot_decision_boundary(model, X, y, title):
    h = .02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.contourf(xx, yy, Z, alpha=0.3)
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors="k")
    plt.title(title)
    plt.xlabel("PCA1")
    plt.ylabel("PCA2")
    plt.show()


plot_decision_boundary(svm_linear_2d, X_test_2d, y_test_2d, "SVM Linear Kernel (2D PCA)")
plot_decision_boundary(svm_rbf_2d, X_test_2d, y_test_2d, "SVM RBF Kernel (2D PCA)")
No description has been provided for this image
No description has been provided for this image
In [ ]:
 

evaluation:¶

In [38]:
param_grid = {
    "C": [0.1, 1, 10],
    "gamma": ["scale", 0.001, 0.01, 0.1]
}

# Create GridSearchCV
grid = GridSearchCV(SVC(kernel="rbf"), param_grid, cv=5)
grid.fit(x_train, y_train)

print("Best Parameters:", grid.best_params_)
Best Parameters: {'C': 10, 'gamma': 0.01}
In [47]:
y_pred = grid.predict(x_test)
In [48]:
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)
Accuracy: 0.9148936170212766
In [49]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n",cm )

plt.figure(figsize=(10,6))
sns.heatmap(cm, annot=True, linewidths=2, fmt="d", cmap="Blues", xticklabels=["Benign", "Malignant"], yticklabels=["Benign", "Malignant"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.tight_layout()
plt.show()
Confusion Matrix:
 [[43  0]
 [ 4  0]]
No description has been provided for this image
In [50]:
# Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))
Classification Report:
               precision    recall  f1-score   support

           0       0.91      1.00      0.96        43
           1       0.00      0.00      0.00         4

    accuracy                           0.91        47
   macro avg       0.46      0.50      0.48        47
weighted avg       0.84      0.91      0.87        47

In [51]:
cv_scores = cross_val_score(grid.best_estimator_, x_scaled, y, cv=5)
print("Cross-Validation Accuracy: {:.2f} ± {:.2f}".format(cv_scores.mean(), cv_scores.std()))
Cross-Validation Accuracy: 0.97 ± 0.02
In [52]:
breast_dataset
Out[52]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.30010 0.14710 ... 25.380 17.33 184.60 2019.0 0.16220 0.66560 0.7119 0.2654 0.4601 0.11890
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.08690 0.07017 ... 24.990 23.41 158.80 1956.0 0.12380 0.18660 0.2416 0.1860 0.2750 0.08902
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.19740 0.12790 ... 23.570 25.53 152.50 1709.0 0.14440 0.42450 0.4504 0.2430 0.3613 0.08758
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.24140 0.10520 ... 14.910 26.50 98.87 567.7 0.20980 0.86630 0.6869 0.2575 0.6638 0.17300
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.19800 0.10430 ... 22.540 16.67 152.20 1575.0 0.13740 0.20500 0.4000 0.1625 0.2364 0.07678
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
564 926424 M 21.56 22.39 142.00 1479.0 0.11100 0.11590 0.24390 0.13890 ... 25.450 26.40 166.10 2027.0 0.14100 0.21130 0.4107 0.2216 0.2060 0.07115
565 926682 M 20.13 28.25 131.20 1261.0 0.09780 0.10340 0.14400 0.09791 ... 23.690 38.25 155.00 1731.0 0.11660 0.19220 0.3215 0.1628 0.2572 0.06637
566 926954 M 16.60 28.08 108.30 858.1 0.08455 0.10230 0.09251 0.05302 ... 18.980 34.12 126.70 1124.0 0.11390 0.30940 0.3403 0.1418 0.2218 0.07820
567 927241 M 20.60 29.33 140.10 1265.0 0.11780 0.27700 0.35140 0.15200 ... 25.740 39.42 184.60 1821.0 0.16500 0.86810 0.9387 0.2650 0.4087 0.12400
568 92751 B 7.76 24.54 47.92 181.0 0.05263 0.04362 0.00000 0.00000 ... 9.456 30.37 59.16 268.6 0.08996 0.06444 0.0000 0.0000 0.2871 0.07039

569 rows × 32 columns

In [53]:
new_data_scaled = [14.5, 20.5, 95.5, 600, 0.1, 0.2, 0.3, 0.1, 0.2, 0.07, 
                   0.3, 1.2, 2.0, 25.0, 0.01, 0.04, 0.05, 0.02, 0.02, 0.01,
                   16.0, 30.0, 120.0, 900, 0.14, 0.4, 0.5, 0.2, 0.3, 0.08]

new_data_scaled.append(0)

new_data_scaled_reshaped = [new_data_scaled]

new_prediction = grid.best_estimator_.predict(new_data_scaled_reshaped)
print(new_prediction)
[0]
In [ ]:
 

Conclusion

In this analysis, we built a Support Vector Machine (SVM) model to classify breast cancer tumors as malignant or benign using the Breast Cancer Wisconsin dataset. The dataset was standardized and visualized using PCA, enabling us to plot decision boundaries for both linear and RBF kernels in two dimensions.

We then performed hyperparameter tuning using GridSearchCV to find the optimal values of C and gamma. The final model achieved a high test accuracy and showed strong performance across cross-validation folds.

Key Takeaways:

PCA helped visualize class separability in 2D.

RBF kernel SVM with tuned parameters outperformed the linear model.

The model achieved excellent classification performance, making it suitable for early detection tasks in healthcare.

In [ ]:
 

Thank you¶

In [ ]: